PostgreSQL Graph Architecture Proposal
1. Executive Summary
We propose shifting the GraphRAGEngine storage backend from **LanceDB (Edge List)** to **PostgreSQL (Relational Graph)**.
**Why?**
- **Statelessness**: Allows utilizing atomic SQL queries (Recursive CTEs) for traversal without loading the entire graph into RAM.
- **Infrastructure**: Reuses the existing PostgreSQL instance. No new infrastructure (Embedded Graph) or sticky volumes required.
- **Reliability**: ACID compliance for graph updates.
2. Proposed Schema
We will introduce 3 normalized tables to represent the property graph structure.
`graph_nodes`
Represents entities (People, Projects, Documents).
CREATE TABLE graph_nodes (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
workspace_id VARCHAR(255) NOT NULL,
name VARCHAR(512) NOT NULL,
type VARCHAR(50) NOT NULL, -- e.g., 'person', 'task'
description TEXT,
properties JSONB DEFAULT '{}', -- Flexible metadata
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
-- Fast lookup
CONSTRAINT uniq_node_workspace_name UNIQUE (workspace_id, name, type)
);
CREATE INDEX idx_nodes_workspace ON graph_nodes(workspace_id);`graph_edges`
Represents relationships.
CREATE TABLE graph_edges (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
workspace_id VARCHAR(255) NOT NULL,
source_node_id UUID REFERENCES graph_nodes(id) ON DELETE CASCADE,
target_node_id UUID REFERENCES graph_nodes(id) ON DELETE CASCADE,
relationship_type VARCHAR(50) NOT NULL, -- e.g., 'manages', 'blocks'
weight FLOAT DEFAULT 1.0,
properties JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_edges_source ON graph_edges(source_node_id);
CREATE INDEX idx_edges_target ON graph_edges(target_node_id);`graph_communities` (For GraphRAG)
Stores the hierarchical clusters detected by the Leiden algorithm.
CREATE TABLE graph_communities (
id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
workspace_id VARCHAR(255) NOT NULL,
level INT NOT NULL, -- Hierarchy level (0, 1, 2)
summary TEXT NOT NULL, -- LLM-generated summary
keywords JSONB, -- List of top keywords
created_at TIMESTAMP DEFAULT NOW()
);`community_membership`
Maps nodes to communities.
CREATE TABLE community_membership (
community_id UUID REFERENCES graph_communities(id) ON DELETE CASCADE,
node_id UUID REFERENCES graph_nodes(id) ON DELETE CASCADE,
PRIMARY KEY (community_id, node_id)
);3. Workflow & Logic
A. Ingestion (Write)
When a document is processed:
- **Extract**: LLM extracts
(Entity A, Relation, Entity B). - **Upsert Nodes**: Insert
Entity AandEntity Bintograph_nodes(ON CONFLICT DO NOTHING). - **Insert Edges**: Insert the relationship into
graph_edges.
B. Retrieval (Read) - "Stateless Graph"
Instead of hydrate_workspace (loading 10k nodes into RAM), we use **Recursive SQL**.
**Local Search (Traversal):**
- *Question*: "What does Sarah work on?"
- *SQL*: Find Node "Sarah" -> JOIN
graph_edges-> JOINgraph_nodes(Target). - *Depth*: Efficiently handled by Postgres for depths 1-3.
-- Example: Find everything Sarah touches (Depth 1)
SELECT target.name, edge.relationship_type
FROM graph_nodes source
JOIN graph_edges edge ON source.id = edge.source_node_id
JOIN graph_nodes target ON edge.target_node_id = target.id
WHERE source.name = 'Sarah' AND source.workspace_id = 'ws_123';C. Community Detection (Async Job)
Since we can't run Leiden/Louvain efficiently in SQL:
- **Job**: Periodically (e.g., hourly), a worker process fetches the full graph for a workspace.
- **Compute**: Loads into
NetworkX(RAM), runs Leiden algorithm. - **Summarize**: Generates summaries via LLM.
- **Save**: Writes results to
graph_communitiestable. - **Free RAM**: Process exits or clears memory.
4. Pros vs Cons
| Feature | Postgres Architecture | LanceDB/In-Memory (Legacy) |
|---|---|---|
| **Persistence** | ✅ Excellent (RDS) | ❌ Poor (Volatile RAM) |
| **Memory Usage** | ✅ **Low** (O(1) per query) | ❌ High (O(N) per workspace) |
| **Traversal Speed** | ⚠️ Medium (SQL Joins) | ✅ Instant (In-Memory) |
| **Complexity** | ✅ Low (Standard SQL) | ⚠️ Medium (Custom Engine) |
| **Vector Search** | ✅ Native (pgvector) | ✅ Native (LanceDB) |
*Note: We will KEEP LanceDB for pure text chunk vector search. Postgres is purely for the structural Metadata Graph.*
5. Migration Strategy
- **Definition**: Create SQLAlchemy models in
backend-saas/core/models.py. - **Migration**: Generate Alembic migration scripts.
- **Refactor**: Update
GraphRAGEngineto read/write from SQL models instead of internal dicts.